User’s guide

scRNA seq attempts to identify the expression level of all the genes in each cell from an heterogeneous population. The starting point for to analyze snRNA-seq experiments in this application will be tables, or read count matrices, containing the number of copies detected of each type of mRNA in each individual cell.

To ilustrate each of the analysis steps we will use a 10X Genetics experiment on aproximately 10.000 peripheral blood mononuclear cells

Load Data

This app can read single cell RNA-seq data in several formats:

Click on the buttons “Select File” or “Select Folder” to choose an input dataset, and the click on “Load Data”.

There’s also the option to load additional metadata -for instance, from a previous analysis- selecting an excel file or a plaint text table (‘.csv’, ‘.tsv’, ‘.txt’) with extra cell information. The only requisite is that the cell IDs in the dataset and de selected table match. A notification will pop up when not all the cells of the dataset are included in the meta data table.

Finally, since some analysis steps, like cell type imputation, require the results of differenteial gene expression and this can be a time consuming process, there is also the option to load previously calculated DE results from an RDS or excel file.

After loading the dataset, a brief summary will appear at the bottom:

In the rest of sections of the application, the ‘Help & Info’ box at the top right will display the project name and the number of cells in the dataset:

Merging datasets

Merging or integrating datasets is a necessary step to compare and contrast cell groups minimizing possible batch effect or technical biases. In this app there are two types of data merging implemented:

Normalization is a pre-processing step for expression data that tries to remove technical variability while preserving biological differences, so the variance of a gene reflects biological heterogeneity. There are two normalization methods implemented:

QC analysis

Do a quick quality assesment of the cells in your dataset and determine thresholds to filter low quality cells. In this section you can check the distribution of read counts, number of genes per cell, the percentage of reads corresponding to mitocondrial or ribosomal RNA. In addition to that, theres also the possibility to estimate the influence of the cell cycle on the cells.

Read and Gene Counts

One of the easiest ways to assess the quality of the cells in the dataset is to take a look at the number of read counts (“nCount_RNA”) and genes (“nFeature_RNA”) detected in each one.

In addition to that, comparing gene count with the percentage of mitocondrial or ribosomal RNA (“percent.mt” and “percent.ribo”) is another simple way to pinpoint low quality cells. In most cases, high percentages of mtRNA will be caused by dying or degraded cells, issues in sample preparation, etc.

Plots of basic quality measurments, number of reads and genes per cell, and percentages of mitocondrial and ribosomal reads

As is shown in the image above, a minimum threshold of 1000 detected genes per cell will filter out most of those low-quality cells. On the other hand, the percentange of ribosomal RNA separates the cells in two groups. In situations like this, it can be interesting to check if this has any relation to cell type in the sample.

Additionally, a test with a few basic default filters will be applied, but its important to take into account that this is just an example and the particular values used could be unfit for the dataset:

Plots of basic quality measurements after automatic filtering

Cell Cycle Scoring

Cell cycle driven gene expression can have an important impact in single-cell RNA-seq experiments.

Down below are the cell cycle scoring results of a bone marrow dataset where cell cycle phase has an significant influence on the gene expression profiles of the cells (source data from Nestorowa et al., Blood 2016). In cases like this, the cells in the dataset will separate clearly by their estimated cell cycle phase.

In a case like this, cell cycle related genes will have an effect on the normalized data used for dimensional reduction and cell clustering, and subsequently on all steps of the analysis than rely on them (differential expression, determining cell types, etc). Using the cell cycle scoring function we can estimate the cycle phase of every cell based on the expression of genes liked to G2M or S phase.

In some situations there is an interest in keeping the variability due to cell cycle heterogeneity: for instance, in a developmental context, where changes in cell proliferation are related to different stages of the cells from stemness to terminal differentiation. More often than not, though, there is more interest in minimizing the influence of cell cycle phase, so using the scoring functionality will allow to regress the impact of call cycle related gene expression.

It is a good idea to come back to this section after clustering or sample merging to check if our cell groups are biased by read counts, number of detected genes, etc. Differences in mtRNA or ribosomal RNA don’t necessarily have to be linked to random noise technical bias.

In contrast to the bone marrow dataset, here are the results of cell cycle scoring for the peripheral blood mononuclear cell dataset we’ve been using as an example:

As we can see, in this case there isn’t a clear separation of the cells assigned to each phase, we would not expect a strong influence of cell cycle phase on the following steps of the analysis.

Filtering cells

In this section we can discard cells from the dataset or make subsets.

The first set of filters allows do define limits to read and gene counts, percentage of mtRNA and ribosomal RNA. Eahc time there’s a change in the dataset the values in the boxes will update, by default displaying the maximum and minimum value for each variable.

The second set of filters is for any other cell metadata that the dataset might have associated -for instance, if this is a 2nd analysis of the experiment, or if other analysis have been done-. Numerical variables will take a minimum-maximum range and categorial variables, a selection box will be deployed to select the values of interest.

This filters can be used to select cells to keep or to remove cells from the dataset. Some examples:

Any other additional metadata imported in the ‘Load data’ tab can be used as a filter -for instance, mutation or sequence variability annotations-. The last section of this tab is dedicated to variable regression and count normalization. Variable regression is done to try to compensate for the effects of a known variable in the read counts, for example: cell cycle phase, abundance of mitocondrial and ribosomal RNA, or Y chromosome expression -to avoid differential expression results being biased due to a mix of XX and XY samples-.

Normalization is a pre-processing step for expression data that tries to remove technical variability while preserving biological differences, so the variance of a gene reflects biological heterogeneity. There are two normalization methods implemented:

Explore clustering options

This step in the analysis will most of the times be the foundation for results of a scRNA-seq analysis, and so, small differences can impact greatly the conclusions gathered from the data -number of cell types in the samples, differential expression between treatments or conditions, etc-.

Grouping elements in increasing numbers of cluster. Clustering criteria will influence our conclusions about the data: there is a need to keep a balance between a cluster’s robustness and reproducibility and its capacity to be informative.
Even though it might sound counter-intuitive, very granular classifications like D can be detrimental if we don’t have enought information on the biological context to interpret the results and assess their relevance.
On the other hand, situations like A can be appropiate for exploratory analysis, but when it’s possible we should aim to uncover knowledge that is only reachable through techniques like scRNA-seq.

In this section we can explore the influence that several parameters can have in grouping cells together. Is important to take into account that even though running a clustering algorithm will separate the cells in categorical “boxes” -each of the clusters-, the underlying data is not categorical, but quantitative, and more often than not a set of clusters can be the result of dividing a continuous spectrum of cell states.

The most straightforward example of this effects are differentiation processes, in which we can have a continuous spectrum ranging from progenitors to terminally differentiated cells. In cases like this the underlying data might not have clear barriers between each of the differentiation stages, so different clustering parameters will set different borders between clusters. The plots below represent different clustering results of an scRNA-seq of 4-day embryoid bodies, with pluripotent cells on the left side and hemangiogenic progenitors on the right:

This app implements the evaluation of two main parameters: number of principal components and resolution.

Number of principal components

¿What are principal components? When we process large datasets made of many correlated variables (in our case, the counts of each gene), our first step will try to make the analysis less computationally demanding by “summarizing” our data in a more manageable number of variables.

This is what’s called dimensionality reduction, and the most common algorithm of this type is Principal Component Analysis or PCA, which consists on calculating linear combinations of the expression of several genes so that the variability of our dataset of hunderds or thousands of cells can be represented in a few dozens of variables, called principal components.

By default the PCA algorithm will generate 50 principal components, but depending on the complexity of our particular sample we will usually need less than that to characterize our dataset.

To estimate an optimal number of principal components for our dataset we can follow several criteria:

Elbow plot: Decrease in standard deviation of principal components

This can be done through visual inspection of an elbow plot representing the standard deviation of each component (example shown above) or establising a limit in the decrase rate. In orther to find this “elbow” or optimal PC number, our app will calculate at which point the fall in standard deviation is less than 5% for three components in a row. In the example plot above, the optimal number of PCs returned would be 12.

Additional measures

Another way to estimate the optimal PC number is to observe the changes in clusterning results depending on the different values of the parameter. Relying too much on 2D representations can lead to taking a decision based on biases in visual perception or preconceptions of our data -e.g.: if I expect to have 3 cell types in my sample, I might favor parameters that produce 3 groups of cells-. To help evaluate the changes in clutering we have implemented two measurements:

  • Co-clustering conservation: given a range of values for a parameter (in this example, testing the nubmer of PCs between 9-15), another way to evaluate its influece is to meassure how much the clustering changes alongside said parameter. We measure this change by the proportion of pairs of cells that keep being cluster partners. In the plot below we can see that from 9 to 10 PCs there is a large re-structuring of the clusters, but from that point onward most cells will keep the same neighbors.
  • Cluster flow plot: Cluster flow plot: we use Sankey diagrams to represent the changes in clustering. This representation allows us to determine if those changes in clustering are the result of adding more information (more PCs) or are just an artifact. We expect that relevant changes are kept even if we add more PCs, while spurious divisions will often result in unstable variations (like two clusters that keep merging and dividing). The plot below shows the snapshot of the interactive cluster flow plot for 9-19 PCs. Note the “unstable” flow in the top left corner, where a subset of the cells keeps being reorganized in 2 or 3 clusters. Even though the recommended PC number in this case is 14, in this case we see that using 15 PCs two of the clusters(14_8 and 14_10) are re-grouped differently, and this reorganization and remains almost unchanged when we add more PCs. This particular change could be of interest and represent more meaningful differences among cells, so with this dataset we could consider using 15 principal components.

Snapshot of the interactive cluster flow plot for 9-19 PCs

And here would be the corresponding 2D representations, with red arrows pointing to the two re-organized clusters:

Clustering resolution

This application uses the SNN modularity optimization based clustering algorithm implemented in the package Seurat, in which a relevant parameter is ‘resolution’. Clustering resolution will determine the granularity of the resulting cell groups, with low resolutions returning large and general clusters, and higher resolutions producing the opposite. There isn’t an optimal, objective resolution. This will depend on the dataset’s size and diversity.

In this section the resulting plots are similar to those evaluating the number of principal components, depicting co-clustering conservation and clustering flow:

Testing different resolutions from 0 to 1, we get the following co-clustering conservation plot:

Where we can see that there are major reorganizations at several points. This re-organizacions are much more clear in the cluster flow plot:

As is shown in the plot, with resolutions larger than 0.5 we start to see that “unstable re-organization” of several of the clusters as we ask for more granular divisions. This is what usually happens when the differences between cells are less categorical and more continuous, so the border between clusters end up varying a lot.

Measuring Uncertainty

The optimization phase of this clustering algorithm has a stochastic component, which means that, from the same data, we could get slighly different results each time the cells are clustered. The specific result we get each time is determined by the ‘seed’ used, usually a random number. Using a set of seeds and comparing the results is possible to examine how much variation there is among clustering results, which tells us how ambiguous or uncertain the classification of each cell is. This uncertainty score is designed to make easier to identify:

  • Oversplitting: taken to the extremes, clustering results can range from an unique cluster containing all the cells, to each cell having its own cluster. When the resolution selected is too high for the dataset, the clustering algorithm will start to find spurious boundaries that at the end of the day are not representative of meaninful differences between cell groups. A sign that this is happening are uncertainty scores being generally high across all cells.

  • Undersplitting: on the other hand, if the resolution chosen if too low, the cell clusters defined will be too general to be informative down the line, but cluster asignment will be quite robust. An example of this in our PBMC dataset would be choosing a resolution so low that we ended up with two groups: monocyte-like and lymphocyte-like cells.

  • Uncomplete datasets or multiplets: when an isolated cell is assigned a particularly high uncertainty score, it’s usually due to that cell having an expression profile in between two clusters. If we’re working with a small dataset, our issue could be that we’re missing uncommon intermediate states, and thus that cell is tenuously linked to related cell groups. On the other hand, this can also happen when a cell’s read counts are actually the result of the aggregation of two or more cells into a single droplet during cell capture, and thus the erratic cluster assignment responds to the mix of expression profiles of two cell types.

  • Continuous processes: if a particular section of the dataset presents a concetration of high uncertainty scores, we migh be dividing a cells representing a dynamic spectrum of changes in gene expression. At the start of this section we mentioned the case of differentiating cells. Another example are cells in continuous tissues, like endothelial cells forming arteries, capilaries, and veins, which commonly form adjancent groups if all three are present in the sample.

Calculating clustering uncertainty

In the image shown below we display the distribution of uncertainty scores for each cell at each resolution value tested as violin plots, while the black dots represent the mean score for each cluster. SInce this is a large dataset, uncertainty scores were calculated using 25 iterations only. By looking for local minimums -a resolution of 0.3 in this example-, these uncertainty scores can help us balance the trade offs between a classification limited enough to be robust, and detailed enough to be biologically informative. Uncertainty scores and cluster asignments at each resolution can be exported to be examined in more detail (E.g., )

Clustering

In this tab, generate a definitive clustering for the cells in the dataset. The clustering results will be stored as part of the cell meta data, which can be used in any of the other sections of the app (filtering, differential expression, cell type imputation, general plotting, etc. )

Clustering parameters can be chosen manually or determined automatically based on the dataset size, although using the tools to evaluate clustering parameters is recommended. If the parameter selection is set to fully automatic, the clustering algorithm will be run for three resolution values (low, medium, and high, with the particular values depending on dataset size)

As is explained in the last chapter, the optimization phase of this clustering algorithm has a stochastic component that we can take advantage of to measure the similarity of clustering results across several iterations given a set of parameters, named here as clustering uncertainty. If this option is selected, uncertainty scores will also be stored as cell meta data.

In our PBMC example, we will proceed setting the parameters indicated in the previous section: 15 principal components and a clustering resolution of 0.3. In addition to store the results internally, the app will also display the distribution of cluster sizes, as well as uncertainty scores if those are calculated:

To avoid the possible biases due to evaluating clustering results by the 2D visualizations the option to plot the results its deactivated by default, but this can be changed click on the “Generate 2D reduction plots” checkbox. Here are the clustering results for the PBMC dataset plotted over its UMAP reduction:

As we can see here, the most “uncertainly clustered” cells are in two groups, one situated in a transitional area, the border between cluster 0 and 1, and on the other hand, the cells in cluster 12, which is ocasionally merged with cluster 5. Additionally the limits of cluster 2 also contain a few uncertainly clustered cells

2D visualizations

UMAP and tSNE plots are the one of the most recongnizable images linked to single cell analysis. UMAP and tSNE are two non-linear dimensional reduction techniques widely use in machine learning for their capacity to generate easy to interpret 2D maps from very large datasets, which makes them extremely attractive to display highly dimensional scRNA-seq information. As it happens with clustering, dimensional reduction techniques like UMAP and tSNE have an stochastic component as well, so a particular visualization is one of the many possibile ways to position the cells.

In this tab we have two sections:

Choosing an algorithm: UMAP or tSNE?

tSNE and UMAP are the most widely used dimensionality reduction algorithms, and each algorithm produces slightly different results depending on each dataset’s particular characteristics: how many cells types we have, how large is difference between those cell types, how are the cell types related to eachother, etc. None of the analysis tools implemented in this app will use the UMAP or tSNE coordinates as input, so in this context the choice is mainly relevant as a convinient way to display analysis results. Other programs, like some used for trajectory analysis, might do use tSNE or UMAP reductions as input.

Here are two brief guides on how to interpret tSNE and UMAP reductions and what are the strengths and shortcomings of each particular algorithm.

Tuning the parameters

For UMAP or tSNE, define the most influential hyperparameters in each algorithm:

Selecting Default values will set the parameter values and random seed to the default values used in the package Seurat: * Perplexity and number of neighbors = 30 * Minimum distance = 0.3 * Seed for tSNE = 1 * Seed for UMAP = 42

Selecting Set based on dataset size for the general parameter definition will define a range of values for each parameter.

Here is an example with a set of parameter values used to generate an UMAP reduction:

And an example of 6 different random seeds using default parameter values (Number of neighbors =30, min. distance=0.3):

Differential expression

In this tab we can estimate which genes have significantlly different expression patterns between groups of cells. Cells in the dataset can be grouped by their value in one or two variables, for instance, ‘sample’ and ‘condition’, and most commonly, their assigned cluster.

In this app there are three modes of analysis implemented. The difference between them is how are the foreground and background groups defined:

Possible comparisons depending on the mode of analysis

Differential expression results will often be the basis for posterior analysis, such as like cell type imputation in this context, or any other applied to differentially expressed genes generally (functional annotation, transcription factor interaction, etc.)

DE parameters

To both speed up computation and minimize the proportion of low-relevance genes in the results, users can sete several parameters:

DE result formats

The app will display a summary with the parameters and the number of genes DE:

In our PBMC example we get the following results:

Comparison Cells.in.Foreground Cells.in.Background Total.DEG Up.regulated.Genes Down.regulated.Genes
0 VS rest 2025 18733 2197 1589 608
1 VS rest 1711 19047 2145 1738 407
2 VS rest 1448 19310 1892 321 1571
3 VS rest 1288 19470 1327 353 974
4 VS rest 1287 19471 2034 354 1680
5 VS rest 745 20013 1194 461 733
6 VS rest 724 20034 1556 384 1172
7 VS rest 588 20170 1607 1349 258
8 VS rest 473 20285 1426 702 724
9 VS rest 404 20354 564 550 14
10 VS rest 394 20364 1368 350 1018
11 VS rest 189 20569 1376 1273 103
12 VS rest 137 20621 683 244 439
13 VS rest 101 20657 2055 1794 261

Below the summary, the top differentily expressed genes are displayed. Select the one of the comparisons available and the results of the 20 genes with the smallest p-value will be shown.

For each comparison the program will generate a table to store the results. For each gene the following variables are stored:

Lastly, a heatmap will be generated using the top 5 DE genes in each cluster displaying the log-normalized expression in all the dataset. Below is an example using a subset of the PBMC dataset with just three clusters:

Dataset plots

This section of the app is dedicated to generate different types of plots from the dataset:

Gene expression

Create plots displaying log-normalized read counts for up to 4 genes. If there is a 2D reduction stored in the dataset, gene expression can be projected onto the tSNE or UMAP reductions, otherwise violin plots will be generated. To generate and store a 2D reduction of the dataset, use the tools in the 2D visualizations tab

Example of 2D projection (A) and violin plot (B) for the gene LCK in the PBMC dataset

If violin plots are selected, there also the posibility to split the violing by one or two metadata variables (E.g: ‘cluster’, ‘sample’, etc):

Cell metadata

When projecting the meta data into 2D reductions, if any variable is stored as numerical, a prompt will appear asking wheter to treat this variables as categorical (an example of this could be a ‘Sample’ variable with values ‘1’, ‘2’, ‘3’… meant to be interpreted as cualitative, not cuantitative)

Example of 2D projections of the cluster assignment (A) and uncertainty scores (B) in the PBMC dataset When generation violin or barplots, if the metadata variables selected are categorical (E.g. ‘cluster’), a barplot is created, displaying the percentage of cells in the dataset with each value. If the variables are numerical, violin plot are generated. As with gene expression plots, for numerical variables it’s possible to select additional grouping variables to split the violin plots.

Cell type imputation

Estimate cell type based on differentially expressed genes between groups of cells

There are two modes of cell type imputation available:

Automatic annotation with scCATCH

Start by selecting a grouping variable, like ‘cluster’, and the tissue type closer to your source material among the ones available in scCATCH’s internatl database.

NOTE: scCATCH’s method to calculate differential expression is not the same as the one used internally in this app, so the results might differ slightly.

A notification will pop up in the bottom right if less than 10 tissue markers are present in your dataset. For more information on how scCATCH works internally, check their publication or their GitHub repository

Here is an example of the automatic annotation done in the PBMC dataset:

Besides 2D plots, the results are avaibalbe to be exported independently in table format:

Manual annotation

The manual annotation system allows the user to introduce custom sets of genes linked to biological traits and use them to annotate the cell groups in their scRNA-seq dataset based on the differential expression of said genes. Even though this has ben implemented with cell type imputation in mind, this approach could be used with other cell properties. The recommended mode of differential expression analysis for cell type imputation is ‘1 VS rest’.

To introduce custom markers, type a cell type name and the genes that characerize it and click the ‘Add’ button. Marker sets don’t have to be exclusive, a gene can be linke to more than one cell type. As markers for cell types are added, a table display below the text boxes will be updated. If any markers added are not present in the dataset, a notification will pop up in the bottom right corner:

Cell groups will be assigned a cell type based on the differential expression of the markers selected by the user, returning the cell type whose markers are significantly up-regulated or down-regulated and have a higer mean log2(Fold Change). If none of the introduced markers are significantly up-regulated in a cluster, or the mean log2(FC) of DE genes is below the minimum log2(FC) threshold set when calculating differential expression, it will be marked as “Undetermined”.

Some considerations:

Using classical markers for peripheral blood with the PBMC dataset:

Cell.Type Markers
Naive CD4+ T IL7R, CCR7
CD14+ Monocyte CD14, LYZ
Memory CD4+ IL7R, S100A4
B cell MS4A1
CD8+ T CD8A
FCGR3A+ Monocyte FCGR3A, MS4A7
Natural killer GNLY, NKG7
Dendritic cell FCER1A, CST3
Platelet PPBP

The results in this section will consist of three elements: annotation plots, cell imputation table, and marker information table.

Annotation plots:

One plot is generated for each cell group, with a 2D reduction plot on the left side -if available- and a barplot displaying the differential expression values of each group of marker on the right. The bars in the plot are arrange to form a circle, which the minimum value in the center and the maximum in the outward perimeter, with the numeric scale at the center top of the circle. Genes whose differential expression is significant will be colored in saturated and bright colors. If differential expression is not significant, the log2(FC) will be colored in muted, unsaturated hues. Non-expressed genes are assigned a value of 0.

Example of three the resulting plots for manual annotation, in this case for clusters 0 (A), 4 (B), and 13 (C)

Cell type imputation table:

This table gathers information on the assignment of a cell type to each cluster, cointaining the following:

  • Row names reference the name of the comparison made in the differential expression analysis.
  • Group: references the foreground cluster in the differential expression analysis.
  • Size: number of cells in the cluster.
  • DE analysis mode: one of ‘1 VS rest’, ‘1 VS 1’, or ‘Conditional’, depending on the paremeters used in the differential expression analysis.
  • Num of DEGs: total number of significantly differentially expressed genes in the cluster
  • ‘Num of up genes’ and ‘Num of down genes’: number of differentially expressed genes up-regulated (log2(FC) over the minimum threshold) and down-regulated (log2(FC) below the minimum threshold).
  • ‘Markers UpReg’ and ‘Markers DownReg’: subset of the cell type markers introduced that are significantly up-regulated or down-regulated in the cluster.
  • Putative cell type: cell type assigned to the cluster.
Group Size DE analysis mode Num of DEGs Num of up genes Num of down genes Markers UpReg Markers DownReg Putative cell type
0 VS rest 0 2025 1 VS rest 2197 1589 608 CD14, LYZ, S100A4, CST3 IL7R, CCR7, IL7R CD14+ Monocyte
1 VS rest 1 1711 1 VS rest 2145 1738 407 CD14, LYZ, S100A4, MS4A7, CST3 IL7R, CCR7, IL7R, FCGR3A CD14+ Monocyte
2 VS rest 2 1448 1 VS rest 1892 321 1571 IL7R, CCR7, IL7R CD14, LYZ, S100A4, MS4A7, NKG7, CST3 Naive CD4+ T
3 VS rest 3 1288 1 VS rest 1327 353 974 IL7R, IL7R CD14, LYZ, MS4A7, CST3 Memory CD4+/Naive CD4+ T
4 VS rest 4 1287 1 VS rest 2034 354 1680 IL7R, CCR7, IL7R, CD8A CD14, LYZ, S100A4, MS4A7, CST3 CD8+ T
5 VS rest 5 745 1 VS rest 1194 461 733 CD8A, FCGR3A, GNLY, NKG7 CCR7, CD14, LYZ, MS4A7, CST3 CD8+ T
6 VS rest 6 724 1 VS rest 1556 384 1172 MS4A1 IL7R, CD14, LYZ, IL7R, S100A4, MS4A7, CST3 B cell
7 VS rest 7 588 1 VS rest 1607 1349 258 S100A4, FCGR3A, MS4A7, CST3 IL7R, CCR7, CD14, LYZ, IL7R FCGR3A+ Monocyte
8 VS rest 8 473 1 VS rest 1426 702 724 FCGR3A, GNLY, NKG7 IL7R, CCR7, CD14, LYZ, IL7R, MS4A7, CST3 Natural killer
9 VS rest 9 404 1 VS rest 564 550 14 CD14, LYZ GNLY, NKG7 CD14+ Monocyte
10 VS rest 10 394 1 VS rest 1368 350 1018 MS4A1 IL7R, CD14, LYZ, IL7R, S100A4, MS4A7, CST3 B cell
11 VS rest 11 189 1 VS rest 1376 1273 103 LYZ, S100A4, FCER1A, CST3 IL7R, CCR7, IL7R Dendritic cell
12 VS rest 12 137 1 VS rest 683 244 439 IL7R, IL7R, CD8A, GNLY, NKG7 CCR7, CD14, LYZ, MS4A7, CST3 Natural killer
13 VS rest 13 101 1 VS rest 2055 1794 261 FCER1A, CST3 IL7R, CD14, LYZ, IL7R, S100A4, MS4A7 Dendritic cell

Marker information:

This table has a brief summary on the cell type markers introduced and their expression in the dataset, which allows to detect if any gene might not be expressed in enough quantity to function as a cell type marker. The table includes the following fields:

  • Gene: gene ID introduced
  • Cell type: cell type the gene is linked to
  • Pct cells expressing: percentage of cells in the dataset in which this gene has a read count over zero.
  • mean CPM in expressing cells: of the subset of cells that express this gene, mean counts per million reads for this gene.
Gene Cell type Pct cells expressing mean CPM in expressing cells
IL7R Naive CD4+ T 41.51 1171.18
CCR7 Naive CD4+ T 31.27 373.42
CD14 CD14+ Monocyte 37.43 583.55
LYZ CD14+ Monocyte 59.69 8554.46
IL7R Memory CD4+ 41.51 1171.18
S100A4 Memory CD4+ 85.21 2551.64
MS4A1 B cell 11.40 1162.62
CD8A CD8+ T 17.00 361.51
FCGR3A FCGR3A+ Monocyte 20.79 721.48
MS4A7 FCGR3A+ Monocyte 29.79 420.53
GNLY Natural killer 15.19 5699.93
NKG7 Natural killer 22.61 2517.98
FCER1A Dendritic cell 1.93 403.96
CST3 Dendritic cell 49.20 2068.29
PPBP Platelet 1.09 3134.86

Altered pathways

The aim of this section is to find which metabolic pathways might affected by the differential expression observed. To that end, we will examine if there is an overrepresentation of differentially expressed genes among elements linke to a particular pathway.

The app has stored a collection of gene sets from several pathway databases (KEGG, Biocarta, etc…) representing a wide array of molecular pathways. Here we will use the results from the differential expression analysis section to calculate if there is an over-representation of DEGs in each pathway.

The process goes as follows: * Get a list of the genes expressed in the cells tested. For each DE analysis, the first step will be to get a list of the genes that could be considered as “expressed” among the cells in which DE was tested. For this, we’ll take the same threshold as in differential expression analysis: by deafult 25% of the cells analyzed must have read counts > 0 for a gene to be considered “expressed”. * Generate contingency tables. The next step is to count how many of those expressed genes are differentially expressed, and for each pathway gene set, how many genes are part of the pathway:

Are the genes part of this pathway’s gene set?
Yes No
Are the genes Yes a b
differentially expressed? No c d

A contingency table will be made for each pathway gene set. From this contingency table we can calculate an Odds Ratio as (a/b) / (c/d), which quantifies the level of over-representation that differentially expressed genes have in that gene set, and a p-value using Fisher’s exact test. P-values will afterwards be correcter for multiple testing by FDR. * Odds Ratios > 1 indicate enrichment of DEGs in that pathway * Odds Ratios < 1 indicate depletion of DEGs in that pathway

Interpreting the results

After running the analysis, a summary will be displayed in the screen, and the detailed results will be available in excel format (.xlsx). A table of result will be generated for each differential expression analysis selected. The tables in the full results contain:

There’s an example below of the Likert type plots generated to ilustrate DEG enrichment in metabolic pathways. In this case we used the data of endothelial cells from an experiment on lung samples from mice exposed or not to intermitent hypoxia, so after processing the dataset our cells are divided by two variables: treatment and cluster. The four plots below represent:

As a general rule, differential expression between different clusters will result in larger numbers of differentially expressed genes than if we test cells of the same cluster separated by another criteria (treatment, sample, etc), so it’s expected for conditional differential expression analysis to also yield less significant enrichment in pathway analysis as well.

Citations

This application implements tools for single cell RNA-seq analysis published in the following R packages:

Experiments used in this guide:

In addition, this application makes use of the following resources: * MSigDB v7.5.1 collection of genesets characterizing metabolic pathways annotated in public databases. such as: * Biocarta * KEGG pathways * Pathway Interation Database (PID) * Reactome * WikiPathways